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Classification and regression using an outer 
approximation projection-gradient method 

Michel Barlaud, Fellow, IEEE, Wafa Belhajali, Patrick L. Combettes Fellow, IEEE, and Lionel Fillatre 


Abstract —This paper deals with sparse feature selection and 
grouping for classiflcation and regression. The classification or 
regression problems under consideration consists in minimizing 
a convex empirical risk function subject to an E constraint, a 
pairwise l°° constraint, or a pairwise E constraint. Existing 
work, such as the Lasso formulation, has focused mainly on 
Lagranglan penalty approximations, which often require ad 
hoc or computationally expensive procedures to determine the 
penalization parameter. We depart from this approach and 
address the constrained problem directly via a splitting method. 
The structure of the method is that of the classical gradient- 
projection algorithm, which alternates a gradient step on the 
objective and a projection step onto the lower level set modeling 
the constraint. The novelty of our approach is that the projection 
step is implemented via an outer approximation scheme in which 
the constraint set is approximated by a sequence of simple convex 
sets consisting of the intersection of two half-spaces. Convergence 
of the iterates generated by the algorithm is established for a 
general smooth convex minimization problem with inequality 
constraints. Experiments on both synthetic and biological data 
show that our method outperforms penalty methods. 


I. Introduction 

In many classification and regression problems, the ob¬ 
jective is to select a sparse vector of relevant features. For 
example in biological applications, DNA microarray and new 
RNA-seq devices provide high dimensional gene expression 
(typically 20,000 genes). The challenge is to select the small¬ 
est number of genes (the so-called biomarkers) which are 
necessary to achieve accurate biological classification and 
prediction. A popular approach to recover sparse feature 
vectors (under a condition of mutual incoherence) is to solve 
a convex optimization problem involving a data fidelity term 
$ and the E norm |[6l, ifTTl . lfT9ll . l[34l . Recent Lasso penalty 
regularization methods take into account correlated data using 
either the pairwise E penalty ll24l . fFh . ll^ or the pairwise 
penalty 0 (see also 1201 for further developments). The 
sparsity or grouping constrained classification problem can be 
cast as the minimization of a smooth convex loss subject to 
an E or a pairwise constraint, say if{w) E tj. Most of 
the existing work has focused on Lagrangian penalty methods, 
which aim at solving the unconstrained problem of minimizing 
$ -I- X(p. Although, under proper qualification conditions, there 
is a formal equivalence between constrained and unconstrained 
Lagrangian formulations 0 Chapter 19], the exact Lagrange 
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multiplier A can seldom be computed easily, which leaves the 
properties of the resulting solutions loosely defined. The main 
contribution of the present paper is to propose an efficient 
splitting algorithm to solve the constrained formulation 

minimize ^(w) (1) 

directly. As discussed in fTTlI . the bound rj defining the 
constraint can often be determined from prior information on 
the type of problem at hand. Our splitting algorithm proceeds 
by alternating a gradient step on the smooth classification 
risk function $ and a projection onto the lower level set 
{w G I <f{w) ^ p}. The main focus is when ip models 
the E, pairwise E constraint, or pairwise constraint. The 
projection onto the lower level set is implemented via an outer 
projection procedure which consists of successive projections 
onto the intersection of two simple half-spaces. The remainder 
of the paper is organized as follows. Section |II] introduces the 
constrained optimization model. Section |III] presents our new 
splitting algorithm, which applies to any constrained smooth 
convex optimization problem. In particular we also discuss 
the application to regression problems. Section [W] presents 
experiments on both synthetic and real classical biological and 
genomics data base. 

II. Classification risk and convex constraints 
A. Risk minimization 

We assume that m samples in are available. 

Typically m < d, where d is the dimension of the feature 
vector. Each sample Xi is annotated with a label taking its 
value in {—1,-|-1}. The classification risk associated with a 
linear classifier parameterized by a vector w is given by 

^ m 

$: —>■ R: w I—>■ — 5^ I tu)). (2) 

m 

i=l 

We restrict our investigation to convex losses (j) which satisfy 
the following assumption. 

Assumption 1 Let /: R — >■ [0,1] be an increasing Lipschitz- 
continuous function which is antisymmetric with respect to the 
point (0,/(O)) = (0,1/2), integrable, and differentiable at 0 
with f'{0) = max/'. The loss R —>■ R A defined by 

(Vf G R) = —t + f f{s)ds. (3) 

J —OO 

The main advantage of this class of smooth losses is that it 
allows us to compute the posterior estimation lH. The function 
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/ relates a prediction {xi | m) of a sample Xi to the posteriori 
probability for the class +1 via 

P[Yi = +1 I Xi] = f{{xi I w)). (4) 

This property will be used in Section |IV] to compute without 
any approximation the area under the ROC curve (AUC). The 
loss (j) in Assumption [T] is convex, everywhere differentiable 
with a Lipschitz-continuous derivative, and it is twice differ¬ 
entiable at 0 with ^"(0) = max(/)". In turn, the function $ of 
(O is convex and differentiable, and its gradient 

^ m 

V-i-. w ^—'^f{{xi\w))xi (5) 

TTl . 

2=1 

has Lipschitz constant 

m m 

Applications to classification often involve normalized features. 
In this case, (O reduces to /3 = f'(0) = Examples of 

functions which satisfy Assumption [T] include that induced by 
the function f:t i—1/(1 -f exp(—f)), which leads to the 
logistic loss (/>: i I—>■ ln(l -|-exp(—i)), for which 0"(O) = 1/4. 
Another example is the Matsusita loss Il29l 

(j): fi—^ 2^~ ^ V^l + ) (7) 

which is induced by /: 1 1 —>■ (t/y/1 + + l) /2. 

B. Sparsity model 

In many applications, collecting a sufficient amount of 
features to perform prediction is a costly process. The chal¬ 
lenge is therefore to select the smallest number of features 
(genes or biomarkers) necessary for an efficient classification 
and prediction. The problem can be cast as a constrained 
optimization problem, namely. 


edge in the graph to be similar. We define the problem of 
minimizing under the directed acyclic graph constraint as 

minimize (10) 

for some suitable parameters p ^ 0. In the second, undirected 
graph, approach 13 one constrains the coefficients of features 
uJi and ujj connected by an edge using a pairwise £°° constraint. 
The problem is to 

minimize $(w). (11) 

lueR'' 

To approach the constrained problems ([3 and (fTOl l. state of the 
art methods employ a penalized variant ca, m, GD, Ea. 
In these Lagrangian approaches the objective is to minimize 
<I>-|-A(/3, where A > 0 aims at controlling sparsity and grouping, 
and where the constraints are defined by one of the following 
(see (la, ([Tol l, and (fTni ) 

(/?2: ^«^E(*,j)eS™ax(|a;i|,|wj|) (12) 

iP3--w^ E(*.j)es |wi - Wjl. 

The main drawback of current penalty formulations resides 
in the cost associated with the reliable computation of the 
Lagrange multiplier A using homotopy algorithms Ca, ED, 
ll22l . |[28ll . The worst complexity case is 0(3"^) ESI , which is 
usually intractable on real data. Although experiments using 
homotopy algorithms suggest that the actual complexity is 
0{d) EH1> the underlying path algorithm remains computa¬ 
tionally expensive for high-dimensional data sets such as the 
genomic data set. To circumvent this computational issue, we 
propose a new general algorithm to solve either the sparse (0) 
or the grouping (ITOl) constrained convex optimization problems 
directly. 


minimize 4>(ui), (8) 

tueR'^ 

where ||w||o is the number of nonzero entries of w. Since 
II • II 0 is not convex, ([3 is usually intractable and an alternative 
approach is to use the norm || • ||i as a surrogate, which yields 
the Lasso formulation |[34l 

minimize 4>(ui). (9) 

tueR'^ 

It has been shown in the context of compressed sensing that 
under a so-called restricted isometry property, minimizing with 
the II • 111 norm is tantamount to minimizing with the || • ||o 
penalty in a sense made precise in 13 ■ 

C. Grouping model 

Let us consider the graph S of connected features 
The basic idea is to introduce constraints on the coefficients 
for features uji and Wj connected by an edge in the graph. 
In this paper we consider two approaches: directed acyclic 
graph and undirected graph. Lused Lasso EH encourages the 
coefficients Wi and Wj of features i and j connected by an 


D. Optimization model 

Our classification minimization problem is formally cast as 
follows. 


Problem 1 Suppose that satisfies Assumption Q] and let 
ip : ^ R /jg convex. Set 


$: 


^ m 

^ I w)) 


(13) 


and 



(14) 


and let be the Lipschitz constant o/V4>, as defined in ( 13 - 
The problem is to 


minimize $(w). (15) 

w^C 

In Section IIVI we shall focus on the three Instances of 
the function p defined in (fT2l) . We assume throughout that 
there exists some p G R such that {a: G C | 4)(a:) ^ p} is 
nonempty and bounded, which guarantees that (fTSl) has at least 
one solution. In particular, this is true if T* or tp is coercive. 
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III. Splitting algorithm 

In this section, we propose an algorithm for solving con¬ 
strained classification problem O. This algorithm fits in 
the general category of forward-backward splitting methods, 
which have been popular since their introduction in data 
processing problem in IfT^ : see also ifO . lfT3l . If30l . 1321, 1(3^ . 
These methods offer flexible implementations with guaranteed 
convergence of the sequence of iterates they generate, a key 
property to ensure the reliability of our variational classifica¬ 
tion scheme. 

A. General framework 

As noted in Section [HI $ is a differentiable convex function 
and its gradient has Lipschitz constant /3, where /3 is given by 
®. Likewise, since (p is convex and continuous, C is a closed 
convex set as a lower level set of p. The principle of a splitting 
method is to use the constituents of the problems, here $ and 
C, separately. In the problem at hand, it is natural to use the 
projection-gradient method to solve (fTSI l. This method, which 
is an instance of the proximal forward-backward algorithm 
Ga, alternates a gradient step on the objective $ and a 
projection step onto the constraint set C. It is applicable in 
the following setting, which captures Problem [1] 

Problem 2 Let $: —?> 'K be a differentiable convex 

function such that is Lipschitz-continuous with constant 
13 G ]0, let p\ —>■ R iie a convex function, let t] gR, 
and set C = {ru G R'^ | p(vj) ^ p}. The problem is to 

minimize $(rt;). (16) 

w^C 

Let Pc denote the projection operator onto the closed 
convex set C. Given wq G R'^, a sequence ( 7 „)„gN of strictly 
positive parameters, and a sequence (a„)„gN in modeling 

computational errors in the implementation of the projection 

operator Pc, the projection-gradient algorithm for solving 
Problem |2] assumes the form 

forn = 0 , 1, ... 

Vn=Wn- 7nV$(u;„) (17) 

_ Wn+1 = Pc{Vn) + a„. 

We derive at once from lfT4l Theorem 3.4(i)] the following 
convergence result, which guarantees the convergence of the 
iterates. 

Theorem 1 Suppose that Problem^has at least one solution, 
let Wq G R'^, let (7„)„gN be a sequence in ]0, and let 

(onjngN be a sequence in R'^ such that 

_^ 2 

||a„|| <-foo, inf 7 „ > 0, and sup 7 „ <-. (18) 

^ neN P 

Then the sequence (re„)„gN generated by dni) converges to a 
solution to Problem |2] 

Theorem [T| states that the whole sequence of iterates con¬ 
verges to a solution. Using classical results on the asymptotic 
behavior of the projection-gradient method l!25l , we can com¬ 
plement this result with the following upper bound on the rate 
of convergence of the objective value. 



Fig. 1: A generic iteration for computing the projection of 
Po onto C. At iteration k, the current iterate is pk and C 
is contained in the half-space H{pQ,pk), onto which pk is 
the projection of po (see (l22l) '). If p{pk) > p, a subgradient 
vector Sk G dp{pk) is in the normal cone to the lower 
level set {p € R'^ | p(p) ^ p(pfe)} at pk, and the subgra¬ 
dient projection Pk+1/2 of pk is defined by (|23]) : it is the 
projection of pk onto the half-space H{pk,Pk+i/ 2 ) which 
contains C. The update Pk+i is the projection of po onto 
H{Po,Pk) <3 H{Pk,Pk+1/2)■ 


Proposition 1 In Theorem [7] suppose that (Vn G N) a„ = 0. 
Then — inf $((7) ^ i?/n -f 1 for some i? > 0. 

The implementation of (fTTI l is straightforward except for the 
computation of Pc{vn)- Indeed, C is defined in (fTTIl as the 
lower level set of a convex function, and no explicit formula 
exists for computing the projection onto such a set in general 
0 Section 29.5]. Fortunately, Theorem [T] asserts that Pc{vn) 
need not be computed exactly. Next, we provide an efficient 
algorithm to compute an approximate projection onto C. 

B. Projection onto a lower level set 

Let Po G R"^, let p: R'^ —>■ R be a convex function, and let 
p G R be such that 

C = {p G R"^ I <p(p) ^ p} 0. (19) 

The objective is to compute iteratively the projection Pc{po) 
of Po onto C. The principle of the algorithm is to replace this 
(usually intractable) projection by a sequence of projections 
onto simple outer approximations to C consisting of the 
intersection of two affine half-spaces ifTOl . 

We first recall that s G R"^ is called a subgradient of p at 
p G R'^ if 0 Chapter 16] 

(Vp G R”^) (p - p I s) + p{p) ^ p(p)- (20) 

The set of all subgradients of p at p is denoted by dp{p). If p 
is differentiable at p, this set reduces to a single vector, namely 
the gradient S/p{p). The projection PciPo) of po onto C is 
characterized by 

fPc(po) G C 

\ (Vp gC) {p- Pc{po) I Po - Pc{po)) < 0. 
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Given x and y in define a closed affine half-space H{x, y) 
by 

H{x,y) = {peR^ \ {p-y\x-y) ^0}. (22) 


Note that H{x,x) = and, if x ^ y, H{x,y) is the 
closed affine half-space onto which the projection of x is 
y. According to (|2TI) . C C H{po,Pc{po)). The principle of 
the algorithm is as follows (see Fig. [T]). At iteration k, if 
pipk) ^ f], then pk G C and the algorithm terminates with 
Pk = Pc{po)- Indeed, since C C H{pQ,pk) HO) Section 5.2] 
and Pk is the projection of po onto H{po,pk), we have 
\\po-Pk\\ < \\po - Pc{po)\\- Hence pk G C ^ pk = PciPo), 
i.e., (p{pk) ^ Pk = Pc{Po)- Otherwise, one first computes 
the so-called subgradient projection of pk onto C. Recall that, 
given Sk G d(p{pk), the subgradient projection of pk onto C 

is 11 , 0 , (9| 


Pk+l/2 — 



V - PjPk) 


-Sk 


if (p{pk) > rj 
if p{Pk) < V- 


(23) 


As noted in la, the closed half-space H{pk,Pk+i/ 2 ) serves 
as an outer approximation to C at iteration k, i.e., C C 
ff(pk,Pk+i/ 2 ); moreover Pk i C ^ pk i H{pk,Pk+i/2)- 
Thus, since we have also seen that C C H{po,pk), we have 


C C Cfc, where Ck = H{po,Pk) fi H{pk,Pk+i/ 2 )- (24) 

The update pk+i is computed as the projection of pq onto the 
outer approximation Ck- As the following lemma from Il2l 
(see also 0 Corollary 29.25]) shows, this computation is 
straightforward. 


Lemma 1 Let x, y, and z be points in R'^ such that 


H{x,y)r H{y,z) ^ 0. (25) 


Moreover, set a = x — y, b = y — z, x = {o- \ b), p = ||a|P, 
u = ll&lp, and p = pv — x^- Then the projection of x onto 
H{x,y)nH{y,z) is 


if p = 0 and x^O 


Q(x,y,z) = < X - if p>f)andxvf^ P 

y+-{xa-pb) if p>0 andxv<P■ 
P 


(26) 


To sum up, the projection of po onto the set C of (fT9l ) will 
be performed by executing the following routine. 


for fc = 0,1,... 
if (p{pk) ^ r] 

[terminate. 

Ck=V- piPk) (27) 

Sk G d(p{pk) 

Pk+l/2 =Pk+ CfcSfc/l|sfc||^ 

_ Pk+l = Q{P0,Pk,Pk+l/2)- 

The next result from ifTOl Section 6.5] guarantees the conver¬ 
gence of the sequence {pk)k+n generated by (IZTi l to the desired 
point. 


Proposition 2 Let pq G R'^, let p: R be a convex func¬ 

tion, and let rj gR be such that C = {p G R^ | pip) ^ v} ^ 
0. Then either (O terminates in a finite number of iterations 
at PciPo) or it generates an infinite sequence {pk)k+n such 
that Pk PciPo)- 

To obtain an implementable version of the conceptual 
algorithm 03, consider its nth iteration and the computation 
of the approximate projection w-n+i of onto C using IZTl) . 
We first initialize IZTl) with pq = Vn, and then execute only 
Kn iterations of it. In doing so, we approximate the exact 
projection onto C by the projection pK^ onto Cx^-i. The 
resulting error is = Pc (po) —PK„ - According to Theoremd 
this error must be controlled so as to yield overall a summable 
process. First, since Pc is nonexpansive 0 Proposition 4.16], 
we have 


WPciPo) - PciPKjW ^ Wpo-PkJ ^ 0. (28) 

Now suppose that pipKn) > V (otherwise we are done). By 
convexity, p is Lipschitz-continuous relative to compact sets 
0 Corollary 8.41]. Therefore there exists C > 0 such that 0 < 

TiPKj-p = PiPKj-piPciPo)) < C\\PK,,-PciPo)\\ 0. 

In addition, assuming that int(C') f 0, using standard error 
bounds on convex inequalities ll26l . there exists a constant 
f > 0 such that 

Ibxn - PciPKjW < ^{piPKj -v) ^0- (29) 

Thus, 

Ibnll = \\PciPo) - PkJ 

< \\PciPo) -PciPKjW + \\Pc{PKn) -PkJ\ 
^\\Po-PKj\-G^{p{PK„)-r])- (30) 

Thus, is suffices to take Kn large enough so that, for instance, 
we have [po - PkJ\ < and pipK^) - V ^ 

for some > 0, ^2 > 0, and e > 0. This will guarantee that 
SngN ll®"ll ^ therefore, by Theorem[T] the conver¬ 
gence of the sequence generated by the following 

algorithm to a solution to Problem 

for n = 0,1,... 

Vn=Wn- 7„V$('u;„) 

Po = Xn 

for fc = 0,1,..., Kn — 1 

Ck =T]-pipk) 

ifCfc^O (31) 

[terminate. 

Sk G dpipk) 

Pk+l/2 =Pk+ CkSk/WSkW^ 

_ Pk+l = QiP 0 ,Pk,Pk+l/ 2 ) 

_ Wn+1 = PK„- 

Let us observe that, from a practical standpoint, we have found 
the above error analysis not to be required in our experiments 
since an almost exact projection is actually obtainable with 
a few iterations of (IZTi l. For instance, numerical simulations 
(see Fig.|2]l on the synthetic data set described in Section ITV-AI 
show that dZTl ) yields in about Kn ~ 7 iterations a point 
very close to the exact projection of po onto C- Note that the 
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Synthetic dataset 



Fig. 2: Convergence of the projection loop (7 iterations). 


by (|6ll. 

for n = 0,1,... 

7„ e [£, (2 - £)/I3] 

m 

Vn=W„- I Wn))x^ 

Po = Vn 

for A: = 0,1,, Kn — 1 

Ck = v- v{pk) 

ifCfc ^0 
[terminate. 

Sfc € d(p{pk) 

Pfe+l/2 =Pfe + CfeSfe/||Sfe|P 
Xk = (po -Pk\Pk- Pk+ 1 / 2 ) 

Pk = Ibo -Pfcll^ 

Vk = \\Pk -Pfe+l/2|p 
Pk = PkVk - Xk 
if Pfe = 0 and Xfc ^ 0 
\Pk+l = Pfe+l/2 


number of iterations of (|27] | does not depend on the dimension 
d. 


if Pfe > 0 and XkVk > Pfe 
Pfe+l = Po + ( 1 + — ) (pfe+l/2 - Pfe) 


if Pfe > 0 and XkVk < Pk 
Pfe+i =Pfe + ^ (xfc (po -Pfe) +Pfe (Pfe+i /2 -Pfc)) 
rcn+i =Pk„- 

(32) 


A subgradient of <pi at S is s = (sign(^i))i^i!£d> 

where 

{ 1 if C > 0 

0 if ^ = 0 (33) 

-1 if ^ < 0. 


Remark 1 (multiple constraints) We have presented above 
the case of a single constraint, since it is the setting em¬ 
ployed in subsequent sections. Flowever, the results of m 
Section 6.5] enable us to extend this approach to problems 
with p constraints, see Appendix A. 


The ith component of a subgradient of (p 2 at C 

is given by 

y- fsign(C*) ifbil^b/l ^. 34 ^ 

, _ 1 0 otherwise. 

(+/)eS t 

The ith component of a subgradient of pa at G R'^ 

is given by 


E 

(+i)eS 


sign(6 

0 


) if b ij 

otherwise. 


(35) 


C. Application to Problem\I\ 


It follows from 0, (l26l) . and (IZTl) . that OTb for the classi¬ 
fication problem can be written explicitly as follows, where e 
is an arbitrarily small number in ]0,1[ and where (3 is given 


D. Application to regression 

A common approach in regression is to learn w G R'^ by 
employing the quadratic loss 

I m 

- V|(a;i|w)-y/ (36) 

2m ' ' 

2=1 

instead of the function T* of ( fTSl l in Problem |2l Since rp is 
convex and has a Lipschitz-continuous gradient with constant 
/3 = crj, where tJi is the largest singular value of the matrix 
bil • • • \xm], it suffices to change the definition of Vn in 
by 

m 

Vn=Wn-—y^({Xi\Wn)-yi)xi. (37) 

2=1 
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IV. Experimental evaluation 

We illustrate the performance of the proposed constrained 
splitting method on both synthetic and real data sets. 


A. Synthetic data set 

We first simulate a simple regulatory network in genomic 
described in Ell. A genomic network is composed of regu¬ 
lators (transcription factors, cytokines, kinase, growth factors, 
etc.) and the genes they regulate. Our notation is as follows: 

• m: number of samples. 

• iVreg: number of regulators. 

• Ng\ number of genes per regulator. 

. d = 7V,eg(A^g + 1). 

The entry of the matrix X = [xi \ ■ ■ ■ \xm]^, composed of 
m rows and d columns, is as follows. 

(i) The rth regulator of the ith sample is 

Ci.reg^ ^ 2 ,A/'g(r—l)+r ^i,r{Ng + l)—Ng ^ A^(0, 1). 

This defines for j of the form r{Ng + 1) — Ng. 

(ii) The genes associated with reg have a joint bivariate 
normal distribution with a correlation of p = 0.7 

^^^(Afg + l) — Afg+fe ^ ^i^reg^ I 1 & }■ 

This defines ^ ^(iVg + 1) — Ng. 

The regression response Y is given by Y = Xw + e, where 
e ^ AfiO, a^) with a = 2. 


Example 1 In this example, we consider that 9 genes reg¬ 
ulated by the same regulators are activated and 1 gene is 
inhibited. The true regressor is defined as 


Example 3 This example is similar fo Example [T] but we 
consider that 7 genes regulated by the same regulators are 
activated and 3 genes are inhibited. The true regressor is 


w = 







B. Breast cancer data set 

We use the breast cancer data set which consists of 
gene expression data for 8,141 genes in 295 breast cancer 
tumors (78 metastatic and 217 non-metastatic). In the time 
comparison evaluation, we select a subset of the 8141 genes 
(range 3000 to 7000) using a threshold on the mean of the 
genes. We use the network provided in |[8l with p — 639 
pathways as graph constraints in our classifier. In biological 
applications, pathways are genes grouped according to their 
biological functions [0, lIZTil . Two genes are connected if they 
belong to the same pathway. Let Si be the subset of genes 
that are connected to gene i. In this case, we have a subset of 
only 40,000 connected genes in S;. Note that we compute the 
subgradient (l34] | only on the subset S; of connected genes. 


C. Comparison between penalty method and our con¬ 
strained method for classification 

First, we compare with the penalty approach using glmnet 
MATLAB software lISTl on the breast cancer data set described 
in Section IIV-BI We tuned the number of path iterations n\ 
for glmnet for different values of the feature dimension. The 
number of nonzero coefficients ||iy||o increases with n\. The 
glmnet method requires typically 200 path iterations or more 
(see Eig. 0. 



Example 2 We consider that 8 genes regulated by the same 
regulators are activated and 2 genes are inhibited. The true 
regressor is defined as 


Breast Cancer dataset 




8 2 8 2 


Fig. 3: Glmnet: Number of nonzero coefficients as a function 
of nx- 

Our classification implementation uses the logistic loss. Let 
Ills'll 1 ^ V be the surrogate sparsity constraint. Fig. |4] shows 
for different values of the feature dimension that the number 
of nonzero coefficients ||i(;||o decreases monotonically with the 
number of iterations. Consequently, the sweep search over rj 
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TABLE I: Time comparison (Matlab and mex) versus glmnet TABLE II: Time comparison(s) with projection onto the ball 
on . IfTSlI for dimension d = 3022 using Matlab. 



mex-sparse 

mex 

Matlab 

Matlab-sparse 

mex 1311 

Time(s) 

0.0230 

0.0559 

0.169 

0.0729 

0.198 



Matlab 

Matlab sparse 

ball 115J 

Time (s) 

0.169 

0.0729 

0.149 


Breast Cancer 



consists in stopping the iterations of the algorithm when ||w||o 
reaches value specified a priori. 


Breast Cancer data set 



Fig. 4: Number of nonzero coefficients as a function of the 
number of iterations. 

Our direct constrained strategy does not require the often 
heuristic search for meaningful and interpretable Lagrange 
multipliers. Moreover, we can improve processing time using 
sparse computing. Namely, at each iteration we compute scalar 
products using only the sparse suh-vector of nonzero values. 
We compare time processing using the hreast cancer data set 
(n = 295 samples, d = 3022) described in Section IIV-BI 
We provide time comparison using a 2.5 GHz Macbook Pro 
with an i7 processor and Matlah software. We report time 
processing in Table |T] using glmnet software ED and our 
method using either Matlah or a standard mex file. 
Moreover, since the vector w is sparse, we provide mex-sparse 
and matlh-sparse times using sparse computing. Fig. [5] shows 
that our constrained method is ten times faster than glmnet 
on . A numerical experiment is available in ID. 

A potentially competitive alternative constrained opti¬ 
mization algorithms for solving the projection part of our 
constrained classification splitting algorithm is that described 
in na. We plug the specific projection onto the hall algorithm 
into our splitting algorithm. We provide time comparison (in 


TABLE III: Breast cancer AUC comparisons. 



glmnet 1311 

Group Lasso f241 

¥71 

^2 

AUC (%) 

64.5 

66.7 

71.3 

72.3 


seconds) in Table HIl for classification for the breast cancer data 
set (d = 3022) described in Section HV-BI Note that the most 
expensive part of our algorithm in terms of computation is the 
evaluation of the gradient. Although the projection onto the 
hall ifTSll is faster than our projection, our method is basically 
12% slower than the specific ipi constrained method for 
dimension d = 3022. However, our sparse implementation of 
scalar products is twice as fast. Moreover, since the complexity 
of our method relies on the computation of scalar products, 
it can be easily speed up using multicore CPU or Graphics 
Processing Unit (GPU) devices, while the speed up of the 
projection on the ball IfTSl using CPU or GPU is currently an 
open issue. In addition our method is more flexible since it 
can take into account more sophisticated constraints such as 
‘P 2 , ‘Ps, or any convex constraint. We evaluate classification 
performance using area under the ROC curve (AUC). The 
result of Table nni show that our ipi constrained method 
outperforms the (pi penalty method by 5.8%. Our (p 2 constraint 
improves slightly the AUC by 1% over the (pi constrained 
method. We also observe a significant improvement of our 
constrained ip 2 method over the penalty group Lasso approach 
discussed in l24|- In addition, the main benefit of the (p 2 
constraint is to provide a set of connected genes which is 
more relevant for biological analysis than the individual genes 
selected by the (pi constraint. 

D. Comparison of various constraints for regression 

In biological applications, gene activation or inhibition are 
well known and summarized in the ingenuity pathway analysis 
(IPA) database ID. We introduce this biological a priori 
knowledge by replacing the (p^ constraint by 

(fi^: w E \uj^ - QijOjjl, (38) 

(hi)es 

where Uij = 1 if genes i and j are both activated or inhibited, 
and Qij = — 1 if gene i is activated and gene j inhibited. We 
compare the estimation of w for Example[2using ipi versus the 
Lp 2 and ipi constraint. For each fold, we estimate the regression 
vector w on 100 training samples. Then we evaluate on new 
100 testing samples. We evaluate regression using the mean 
square error (MSE) in the training set and the predictive mean 
square error (PMSE) in the test set. We use randomly half of 
the data for training and half for testing, and then we average 
the accuracy over 50 random folds. 

We show in Fig.[6a|the true regression vector and, in Fig.lbbl 
the estimation using the tpi constraint for Example[3] In Fig.IbB 
we show the results of the estimation with the if 2 constraint. 
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Fig. 6: Example [2 (a): True vector w. (b): Estimation with the ipi constraint, (c): Estimation with the (p 2 constraint, (d): 
Estimation with the (p 4 constraint. 



Fig. 7; ipi constraint for Examples [T] |2l andO Mean square 
error as a function of the parameter ry. 


and in Eig. [6d] with the ip 4 constraint. We provide for the 
three examples the mean square error as a function of rj for 
ipi (Eig. [7]l, (p 2 (Fig. [8]), and ip 4 (Fig. [9]l. We report for 
Example |2] in Eig. [TO] the estimation of the mean square error 
in the training set as a function of the number of training 
samples for the ipi, (p 2 , and (p 4 constraint. The ip 4 constraint 
outperforms both the (p 2 and the (pi constrained method. 
However, the selection of the parameter p for constraint p 4 
is more challenging. 



Parameter rj 

Fig. 8: p 2 constraint for Examples [T] |2l and [3 Mean square 
error as a function of the parameter ry. 


V. Conclusion 

We have used constrained optimization approaches to pro¬ 
mote sparsity and feature grouping in classification and regres¬ 
sion problems. To solve these problems, we have proposed a 
new efficient algorithm which alternates a gradient step on 
the data fidelity term and an approximate projection step onto 
the constraint set. We have also discussed the generalization 
to multiple constraints. Experiments on both synthetic and 
biological data show that our constrained approach outper¬ 
forms penalty methods. Moreover, the formulation using the 
P 4 constraint outperforms those using the pairwise p 2 and the 
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Parameter rj 

Fig. 9 : <p 4 constraint for Examples [T] |2l and 0 Mean square 
error as a function of the parameter rj. 


Thenpfe+i = Qipo,Pk,Pk+i/ 2 ) Pc{Po) COl Theorem 6.4] 
and therefore the generalization 

for n = 0,1,... 

Vn=Wn- 7nV4>(r(;„) 

Po = Vn 

for fc = 0,1,..., Kn — 1 

Cfc = mini^j^p [pj - Pj{pk)) 

ifCfe^O (A5) 

[terminate, 
for j = 1,... ,p 
[sj.fc e dpjipk) 

compute p/j_|_ 4/2 in (IA2ll - (IA4b 

_ Wn+l ^PK^- 

of (OTb solves dAll) . 



Number of samples 

Fig. 10: MSE as a function of the number of samples m for 
Example |2l 


pi constraint. 

Appendix A - The case oe multiple constraints 
Let $ be as in Problem [2] and, for every j £ {1,... ,p}, let 
ipj : —>■ K be convex, let pj £ R, and let ojj £ ]0,1] be 

such that — 1- Consider the problem 

minimize 4>(ti;). (Al) 


ipp{w)^rjp 

In other words, C = n^=i {w C | ‘Pjiw) < pj} in (fT^ . 
Let k G N. For every j £ {1,... ,p}, let Sj^k C 9(pj{pk) and 
set 


Pj,k — 

Now define 


Vj-Pj{Pk) 

Pk II ||2 if ‘PjiPk) > Pj 




Pk 


if Pi(Pfe) ^ Vj- 


(A2) 


Pfc+i/2 = Pfc + Tfc ^jPj,k - Pkj , (A3) 


where 


Lk 


p 

'^(^j\\pj,k-Pk\\'^ 

j=i 


^ ^ ^jPj,k Pk 


(A4) 
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